19 research outputs found
CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction
Given the recent advances in depth prediction from Convolutional Neural
Networks (CNNs), this paper investigates how predicted depth maps from a deep
neural network can be deployed for accurate and dense monocular reconstruction.
We propose a method where CNN-predicted dense depth maps are naturally fused
together with depth measurements obtained from direct monocular SLAM. Our
fusion scheme privileges depth prediction in image locations where monocular
SLAM approaches tend to fail, e.g. along low-textured regions, and vice-versa.
We demonstrate the use of depth prediction for estimating the absolute scale of
the reconstruction, hence overcoming one of the major limitations of monocular
SLAM. Finally, we propose a framework to efficiently fuse semantic labels,
obtained from a single frame, with dense SLAM, yielding semantically coherent
scene reconstruction from a single view. Evaluation results on two benchmark
datasets show the robustness and accuracy of our approach.Comment: 10 pages, 6 figures, IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR), Hawaii, USA, June, 2017. The first two
authors contribute equally to this pape
Training-Free Layout Control with Cross-Attention Guidance
Recent diffusion-based generators can produce high-quality images from
textual prompts. However, they often disregard textual instructions that
specify the spatial layout of the composition. We propose a simple approach
that achieves robust layout control without the need for training or
fine-tuning of the image generator. Our technique manipulates the
cross-attention layers that the model uses to interface textual and visual
information and steers the generation in the desired direction given, e.g., a
user-specified layout. To determine how to best guide attention, we study the
role of attention maps and explore two alternative strategies, forward and
backward guidance. We thoroughly evaluate our approach on three benchmarks and
provide several qualitative examples and a comparative analysis of the two
strategies that demonstrate the superiority of backward guidance compared to
forward guidance, as well as prior work. We further demonstrate the versatility
of layout guidance by extending it to applications such as editing the layout
and context of real images.Comment: WACV 2024, Project Page:
https://silent-chen.github.io/layout-guidance
DGE: direct gaussian 3D editing by consistent multi-view editing
We consider the problem of editing 3D objects and scenes
based on open-ended language instructions. A common approach to this
problem is to use a 2D image generator or editor to guide the 3D editing
process, obviating the need for 3D data. However, this process is often
inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits
or score distillation sampling. A major disadvantage of this approach
is the slow convergence caused by aggregating inconsistent information
across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method
that addresses these issues in two stages. First, we modify a given highquality image editor like InstructPix2Pix to be multi-view consistent. To
do so, we propose a training-free approach that integrates cues from the
3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the
3D representation, which is based on 3D Gaussian Splatting. Because it
avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits,
such as enabling selective editing of parts of the scene
Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing
Self-supervised visual representation learning has recently attracted
significant research interest. While a common way to evaluate self-supervised
representations is through transfer to various downstream tasks, we instead
investigate the problem of measuring their interpretability, i.e. understanding
the semantics encoded in raw representations. We formulate the latter as
estimating the mutual information between the representation and a space of
manually labelled concepts. To quantify this we introduce a decoding
bottleneck: information must be captured by simple predictors, mapping concepts
to clusters in representation space. This approach, which we call reverse
linear probing, provides a single number sensitive to the semanticity of the
representation. This measure is also able to detect when the representation
contains combinations of concepts (e.g., "red apple") instead of just
individual attributes ("red" and "apple" independently). Finally, we propose to
use supervised classifiers to automatically label large datasets in order to
enrich the space of concepts used for probing. We use our method to evaluate a
large number of self-supervised representations, ranking them by
interpretability, highlight the differences that emerge compared to the
standard evaluation with linear probes and discuss several qualitative
insights. Code at: {\scriptsize{\url{https://github.com/iro-cp/ssl-qrp}}}.Comment: Published at ICLR 2022. Appendix included, 26 page
Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations
We present Neural Feature Fusion Fields (N3F), a method that improves dense
2D image feature extractors when the latter are applied to the analysis of
multiple images reconstructible as a 3D scene. Given an image feature
extractor, for example pre-trained using self-supervision, N3F uses it as a
teacher to learn a student network defined in 3D space. The 3D student network
is similar to a neural radiance field that distills said features and can be
trained with the usual differentiable rendering machinery. As a consequence,
N3F is readily applicable to most neural rendering formulations, including
vanilla NeRF and its extensions to complex dynamic scenes. We show that our
method not only enables semantic understanding in the context of scene-specific
neural fields without the use of manual labels, but also consistently improves
over the self-supervised 2D baselines. This is demonstrated by considering
various tasks, such as 2D object retrieval, 3D segmentation, and scene editing,
in diverse sequences, including long egocentric videos in the EPIC-KITCHENS
benchmark.Comment: 3DV2022, Oral. Project page: https://www.robots.ox.ac.uk/~vadim/n3f
Diffusion models for open-vocabulary segmentation
Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language
modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation
efforts. Hence, we ask if it is possible to use existing foundation models
to synthesise on-demand efficient segmentation algorithms for specific
class sets, making them applicable in an open-vocabulary setting without
the need to collect further data, annotations or perform training. To
that end, we present OVDiff, a novel method that leverages generative
text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual
categories, creating for each a set of prototypes representative of both
the category and its surrounding context (background). It relies solely on
pre-trained components and outputs the synthesised segmenter directly,
without training. Our approach shows strong performance on a range
of benchmarks, obtaining a lead of more than 5% over prior work on
PASCAL VOC
Unsupervised multi-object segmentation by predicting probable motion patterns
We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence of objects. Hence, our model learns to predict image regions that are likely to contain motion patterns characteristic of objects moving rigidly. It does not predict specific motion, which cannot be done unambiguously from a still image, but a distribution of possible motions, which includes the possibility that an object does not move at all. We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement. Project page and code: https://www.robots.ox.ac.uk/~vgg/research/ppmp
N2F2: hierarchical scene understanding with nested neural feature fields
Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this,
we introduce Nested Neural Feature Fields (N2F2), a novel approach that
employs hierarchical supervision to learn a single feature field, wherein
different dimensions within the same high-dimensional feature encode
scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions
or semantics or both, thereby enabling a comprehensive and nuanced
understanding of scenes. We leverage a 2D class-agnostic segmentation
model to provide semantically meaningful pixel groupings at arbitrary
scales in the image space, and query the CLIP vision-encoder to obtain
language-aligned embeddings for each of these segments. Our proposed
hierarchical supervision method then assigns different nested dimensions
of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine
representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such
as open-vocabulary 3D segmentation and localization, demonstrating the
effectiveness of the learned nested feature field
Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion
Instance segmentation in 3D is a challenging task due to the lack of
large-scale annotated datasets. In this paper, we show that this task can be
addressed effectively by leveraging instead 2D pre-trained models for instance
segmentation. We propose a novel approach to lift 2D segments to 3D and fuse
them by means of a neural field representation, which encourages multi-view
consistency across frames. The core of our approach is a slow-fast clustering
objective function, which is scalable and well-suited for scenes with a large
number of objects. Unlike previous approaches, our method does not require an
upper bound on the number of objects or object tracking across frames. To
demonstrate the scalability of the slow-fast clustering, we create a new
semi-realistic dataset called the Messy Rooms dataset, which features scenes
with up to 500 objects per scene. Our approach outperforms the state-of-the-art
on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well
as on our newly created Messy Rooms dataset, demonstrating the effectiveness
and scalability of our slow-fast clustering method.Comment: NeurIPS 2023 (Spotlight). Code:
https://github.com/yashbhalgat/Contrastive-Lif
IM-3D: iterative multiview diffusion and reconstruction for high-quality 3D generation
Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets